Tool response quality, turn tracing, agent eval CLI — release 0.7.0 by morganlinton · Pull Request #6 · GetSmallAI/SmallHarness

morganlinton · 2026-06-10T02:12:32Z

Summary

fix: tool response quality for three model-facing edge cases (file_read offset past EOF, list_dir truncation total, grep unparseable lines)
refactor: split the 3,000-line commands/mod.rs into config_cmds, context_cmds, memory, session
feat: turn tracing — /trace view, .sessions/<id>.events.jsonl event log, turn timing footer
feat: agent eval CLI (--eval) with mock-SSE integration tests and an optional nightly CI job
chore: release 0.7.0 (Cargo version, CHANGELOG, README badge)

Each commit compiles and passes cargo test, clippy, and fmt independently.

Merging with a merge commit (not squash) to keep the per-feature commits; v0.7.0 will be tagged on the release commit after merge, which triggers the release workflow and the Homebrew tap update.

🤖 Generated with Claude Code

…t convention Models trained on Claude Code's Edit tool send file_edit({old_text: "", new_text: <content>}) to create new files. SmallHarness's file_edit previously returned "File not found" immediately, forcing a retry loop (mkdir → touch → file_edit) that wasted 2-3 extra API round-trips. Now: a single edit with old_text="" on a missing file creates the file (including parent dirs), matching the Claude Code convention exactly. Non-creation cases that hit "not found" or "old_text is empty" now also include "Use file_write to create new files" for faster recovery. Adds two tests: one for the new creation path, one confirming the non-empty-old_text-on-missing-file error still fires with the hint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

file_read: offset past EOF now returns a clear error instead of silently returning empty content, which caused models to think files were empty and retry with different offsets. list_dir: add "total" field to every response so models know the real directory size when truncated (count capped at 500 but total reflects the actual entry count). grep: switch map → filter_map so unparseable rg output lines (e.g. binary-file notices) are dropped rather than emitted as malformed {content: "..."} objects missing the file and line fields. Also moves .take(100) after filter_map to ensure up to 100 *parseable* matches. Adds the first test module for grep.rs (6 new tests total across the three files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y, session commands/mod.rs had grown past 3,000 lines. Move the command handlers into four focused submodules — config_cmds (/config, /backend, /model, /verbose…), context_cmds (/context, /compact, /reset, /checkpoints), memory (/index, /map, /memory, /remember, /forget), and session (/new, /undo, /session, /resume, /export, /path) — leaving dispatch and the command list in mod.rs. No behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add turn_trace: every turn appends structured events (tool calls with redacted args, approvals, compaction, warmup, timing) to a sidecar at .sessions/<session-id>.events.jsonl, enabled by default via display.eventLog.enabled. API keys and sensitive object keys are redacted before anything is written. /trace on|off surfaces nested subagent/critic tool calls as indented lines in the TUI — previously their activity was invisible (events swallowed) — without flooding the parent context. Tool calls now carry a depth field, and the subagent/evaluator tools forward their inner events when tracing is on. The end-of-turn status line gains a timing breakdown (TTFT, model, tools, approval, total), the loader shows which tool is running, compaction of oversized tool output is now reported to the user with the original size, and /export <session> events copies the event log. Also prints a context pressure notice as the prompt budget nears the model's effective limit. /export current events copies the sidecar; /new and /resume reset it to the active session. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…CI job small-harness --eval <fixture> [--model M] [--json] runs a bundled agent eval fixture from the shell and exits 0 on pass / 1 on fail, so evals can gate CI. A new optional macos CI job runs two fixtures against Ollama nightly or when a commit message contains [eval]; it is continue-on-error so a flaky local model never blocks merges. Add agent_integration_test: drives the real agent loop against a mock OpenAI-compatible SSE server (no live LLM) covering a tool-call round trip plus eval checks, and the hit_step_limit cutoff flag. Two fixes surfaced while wiring this up: the rubric heading parser now matches "(weight:" case-insensitively on raw bytes instead of byte offsets from a lowercased copy (which can diverge for some Unicode chars), and the HTTP client gets a 10s connect timeout so a dead backend fails fast instead of hanging — without capping long streaming completions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

morganlinton and others added 6 commits June 8, 2026 06:46

chore: release 0.7.0 — turn tracing, agent eval CLI

b80f11e

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

morganlinton merged commit 57a6175 into main Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tool response quality, turn tracing, agent eval CLI — release 0.7.0#6

Tool response quality, turn tracing, agent eval CLI — release 0.7.0#6
morganlinton merged 6 commits into
mainfrom
fix/tool-response-quality

morganlinton commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

morganlinton commented Jun 10, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant